A Probabilistic Approach for Discovering Authoritative Web Pages
نویسندگان
چکیده
The World Wide Web (WWW) is becoming the most important system for delivering information. Search services on the WWW are becoming increasing popular among users because of the huge amount of data available and consequently it is difficult to retrieve and filter it. Several works [2, 3] have argued that traditional Term-Based search engines are not very useful since the resulting ranking depends on the precision of the user in expressing the query. However, usually, users are unclear about the information they need and so they do not give much thought to query formulation. Moreover, if the query pertains to topics which are abundant on the Web, search services become unusable because of the huge number of pages obtained. For instance, at the time of this work, AltaVista returned more than 18,000,000 pages in reply to the query asking for the documents related to the word “java”. Current research, instead, takes a different approach, which goes under the name of Topic Distillation on the Web. It basically consists in finding documents related to the query topic, but that does not necessarily contain the query string. The following classical example shows the advantage of this [2]. If we want to find Web pages associated to the query string “search engine”, a Term-Based search engine is not useful because it does not return pages like www.yahoo.com, www.altavista.com or www.excite.com. This happens because none of the really interesting pages contains the query string. The purpose of Topic Distillation is to increase the precision of the search algorithm in order to return the most relevant pages, even if there is no trace of the query string in them. In order to achieve this, Kleinberg [2] observed that there is an additional source of information that can be used for searching the Web: its structure made of nodes (web-pages) connected through arcs (links among pages). Using this idea, he proposed a connectivity analysis algorithm based on a mutual reinforcement approach. In fact, a link which is not made for a navigational purpose, encapsulates a human judgment on the page relevance with respect to a certain topic. As a consequence, all pages can be divided into two groups: hubs and authorities. An authority is a relevant page pointed to by many hubs; while a hub is a page that points to Work partially supported by MURST grants under the projects “DataX” and “D2I”. The second author is also supported by ISI-CNR. many relevant ones. Kleinberg’s algorithm tries to compute an authority score for each page as an indicator of relevance. The algorithm works in two main steps; in the first, given a query string it computes a base set consisting of pages that could be potentially relevant for the user. In the second step, it obtains the authorities by applying an iterative procedure to the base set. A second important aspect is the automatic discovery of communities connected to a given topic (Topic Enumeration). Topic search and enumeration are tightly related since in the searching of documents on a given topic it is useful to compute the most authoritative documents but also to identify the different communities. In this paper, we present a technique which, by exploiting the graph structure of the Web, improves the quality of both topic search and enumeration. Our technique is based on the application of a statistical approach to the co-citation matrix (associated with the base set obtained in the first step of the Mutual Reinforcement Approach) to find the most co-cited pages. The different communities are then derived directly from the co-citation matrix (also called similaritymatrix) and the most relevant pages are those which are the “most similar” to all the other pages in the same community. The proposed technique is more general and efficient than the previous techniques proposed in the literature since it is able to identify the different communities and the most authoritative pages without the use of any iterative procedure. We point out that techniques for topic distillation can be applied to Web data (e.g. HTML data) and, generally, to data which can be modelled by graphs such as XML and semistructured data.
منابع مشابه
The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity
We describe a joint probabilistic model for modeling the contents and inter-connectivity of document collections such as sets of web pages or research paper archives. The model is based on a probabilistic factor decomposition and allows identifying principal topics of the collection as well as authoritative documents within those topics. Furthermore, the relationships between topics is mapped o...
متن کاملLearning to Probabilistically Identify Authoritative Documents
We describe a model of document citation that learns to identify hubs and authorities in a set of linked documents such as pages retrieved from the world wide web or papers retrieved from a research paper archive Un like the popular HITS algorithm which re lies on dubious statistical assumptions our model provides probabilistic estimates that have clear semantics We also nd that in general the ...
متن کاملDiscovering task-oriented usage pattern for web recommendation
Web transaction data usually convey user task-oriented behaviour pattern. Web usage mining technique is able to capture such informative knowledge about user task pattern from usage data. With the discovered usage pattern information, it is possible to recommend Web user more preferred content or customized presentation according to the derived task preference. In this paper, we propose a Web r...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملDiscovering Test Set Regularities in Relational Domains
Machine learning typically involves discovering regularities in a training set, then applying these learned regularities to classify objects in a test set. In this paper we present an approach to discovering additional regularities in the test set, and show that in relational domains such test set regularities can be used to improve classification accuracy beyond that achieved using the trainin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001